Annex

Data Columns Detailed

Data summary
Name data
Number of rows 45896
Number of columns 26
_______________________
Column type frequency:
character 8
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Make 0 1 3 34 0 141 0
Model 0 1 1 47 0 4762 0
Fuel.Type.1 0 1 6 17 0 6 0
Fuel.Type.2 0 1 0 11 44059 5 0
Drive 0 1 0 26 1186 8 0
Engine.Description 0 1 0 46 17031 590 0
Transmission 0 1 0 32 11 41 0
Vehicle.Class 0 1 4 34 0 34 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 23102.11 13403.10 1.00 11474.75 23090.50 34751.25 46332.00 ▇▇▇▇▇
Model.Year 0 1.00 2003.61 12.19 1984.00 1992.00 2005.00 2015.00 2023.00 ▇▆▆▇▇
Estimated.Annual.Petrolum.Consumption..Barrels. 0 1.00 15.33 4.34 0.05 12.94 14.88 17.50 42.50 ▁▇▃▁▁
City.MPG..Fuel.Type.1. 0 1.00 19.11 10.31 6.00 15.00 17.00 21.00 150.00 ▇▁▁▁▁
Highway.MPG..Fuel.Type.1. 0 1.00 25.16 9.40 9.00 20.00 24.00 28.00 140.00 ▇▁▁▁▁
Combined.MPG..Fuel.Type.1. 0 1.00 21.33 9.78 7.00 17.00 20.00 23.00 142.00 ▇▁▁▁▁
City.MPG..Fuel.Type.2. 0 1.00 0.85 6.47 0.00 0.00 0.00 0.00 145.00 ▇▁▁▁▁
Highway.MPG..Fuel.Type.2. 0 1.00 1.00 6.55 0.00 0.00 0.00 0.00 121.00 ▇▁▁▁▁
Combined.MPG..Fuel.Type.2. 0 1.00 0.90 6.43 0.00 0.00 0.00 0.00 133.00 ▇▁▁▁▁
Engine.Cylinders 487 0.99 5.71 1.77 2.00 4.00 6.00 6.00 16.00 ▇▇▅▁▁
Engine.Displacement 485 0.99 3.28 1.36 0.00 2.20 3.00 4.20 8.40 ▁▇▅▂▁
Time.to.Charge.EV..hours.at.120v. 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
Time.to.Charge.EV..hours.at.240v. 0 1.00 0.11 1.01 0.00 0.00 0.00 0.00 15.30 ▇▁▁▁▁
Range..for.EV. 0 1.00 2.36 24.97 0.00 0.00 0.00 0.00 520.00 ▇▁▁▁▁
City.Range..for.EV…Fuel.Type.1. 0 1.00 1.62 20.89 0.00 0.00 0.00 0.00 520.80 ▇▁▁▁▁
City.Range..for.EV…Fuel.Type.2. 0 1.00 0.17 2.73 0.00 0.00 0.00 0.00 135.28 ▇▁▁▁▁
Hwy.Range..for.EV…Fuel.Type.1. 0 1.00 1.51 19.70 0.00 0.00 0.00 0.00 520.50 ▇▁▁▁▁
Hwy.Range..for.EV…Fuel.Type.2. 0 1.00 0.16 2.46 0.00 0.00 0.00 0.00 114.76 ▇▁▁▁▁

Data summary

The table below provides an overview of the dataset.

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
ID 45896 23102 13403 1 11475 34751 46332
Model.Year 45896 2004 12 1984 1992 2015 2023
Estimated.Annual.Petrolum.Consumption..Barrels. 45896 15 4.3 0.047 13 18 43
Fuel.Type.1 45896
... Diesel 1254 3%
... Electricity 484 1%
... Midgrade Gasoline 155 0%
... Natural Gas 60 0%
... Premium Gasoline 14138 31%
... Regular Gasoline 29805 65%
City.MPG..Fuel.Type.1. 45896 19 10 6 15 21 150
Highway.MPG..Fuel.Type.1. 45896 25 9.4 9 20 28 140
Combined.MPG..Fuel.Type.1. 45896 21 9.8 7 17 23 142
Fuel.Type.2 45896
... 44059 96%
... E85 1513 3%
... Electricity 296 1%
... Natural Gas 20 0%
... Propane 8 0%
City.MPG..Fuel.Type.2. 45896 0.85 6.5 0 0 0 145
Highway.MPG..Fuel.Type.2. 45896 1 6.6 0 0 0 121
Combined.MPG..Fuel.Type.2. 45896 0.9 6.4 0 0 0 133
Engine.Cylinders 45409 5.7 1.8 2 4 6 16
Engine.Displacement 45411 3.3 1.4 0 2.2 4.2 8.4
Time.to.Charge.EV..hours.at.120v. 45896 0 0 0 0 0 0
Time.to.Charge.EV..hours.at.240v. 45896 0.11 1 0 0 0 15
Range..for.EV. 45896 2.4 25 0 0 0 520
City.Range..for.EV...Fuel.Type.1. 45896 1.6 21 0 0 0 521
City.Range..for.EV...Fuel.Type.2. 45896 0.17 2.7 0 0 0 135
Hwy.Range..for.EV...Fuel.Type.1. 45896 1.5 20 0 0 0 520
Hwy.Range..for.EV...Fuel.Type.2. 45896 0.16 2.5 0 0 0 115

Data cleaned overview

Cleaned Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned 42240 18 8 5 None

Cleaned and Reduced Dataset

Name Number_of_rows Number_of_columns Character Numeric Group_variables
data_cleaned_reduced 42061 18 8 5 None

Eigenvalues for the Principal Components Analysis


Call:
PCA(X = data_prepared, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
Variance               4.616   3.735   2.126   1.292   1.019   0.988   0.856
% of var.             25.644  20.748  11.809   7.177   5.660   5.490   4.753
Cumulative % of var.  25.644  46.392  58.201  65.378  71.038  76.527  81.280
                       Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
Variance               0.827   0.765   0.549   0.510   0.349   0.193   0.137
% of var.              4.594   4.250   3.047   2.834   1.941   1.071   0.759
Cumulative % of var.  85.875  90.124  93.172  96.005  97.946  99.017  99.777
                      Dim.15  Dim.16  Dim.17  Dim.18
Variance               0.027   0.008   0.003   0.002
% of var.              0.150   0.045   0.018   0.011
Cumulative % of var.  99.926  99.971  99.989 100.000

Individuals (the 10 first)
                                 Dist    Dim.1    ctr   cos2    Dim.2    ctr
1                            |  3.334 | -1.087  0.001  0.106 |  0.065  0.000
2                            |  3.387 | -0.907  0.000  0.072 | -0.079  0.000
3                            |  3.522 | -1.153  0.001  0.107 | -0.031  0.000
4                            |  2.761 | -1.068  0.001  0.150 |  0.012  0.000
5                            |  2.713 | -0.991  0.001  0.133 | -0.022  0.000
6                            |  2.713 | -0.991  0.001  0.133 | -0.022  0.000
7                            |  2.847 | -1.039  0.001  0.133 | -0.066  0.000
8                            |  2.876 | -1.151  0.001  0.160 | -0.019  0.000
9                            |  4.850 | -1.810  0.002  0.139 |  0.346  0.000
10                           |  3.313 | -1.686  0.001  0.259 |  0.231  0.000
                               cos2    Dim.3    ctr   cos2  
1                             0.000 |  0.018  0.000  0.000 |
2                             0.001 |  2.589  0.007  0.584 |
3                             0.000 |  2.603  0.008  0.546 |
4                             0.000 |  0.658  0.000  0.057 |
5                             0.000 |  0.609  0.000  0.050 |
6                             0.000 |  0.609  0.000  0.050 |
7                             0.001 |  0.490  0.000  0.030 |
8                             0.000 |  0.546  0.000  0.036 |
9                             0.005 | -0.005  0.000  0.000 |
10                            0.005 |  1.551  0.003  0.219 |

Variables (the 10 first)
                                Dim.1    ctr   cos2    Dim.2    ctr   cos2  
make                         |  0.106  0.242  0.011 | -0.092  0.229  0.009 |
model_year                   |  0.347  2.605  0.120 |  0.032  0.028  0.001 |
vehicle_class                | -0.141  0.428  0.020 |  0.051  0.071  0.003 |
drive                        | -0.038  0.031  0.001 |  0.003  0.000  0.000 |
engine_cylinders             |  0.077  0.130  0.006 | -0.073  0.141  0.005 |
engine_displacement          |  0.125  0.338  0.016 | -0.104  0.292  0.011 |
transmission                 | -0.498  5.376  0.248 |  0.069  0.128  0.005 |
fuel_type_1                  | -0.419  3.802  0.175 |  0.223  1.328  0.050 |
city_mpg_fuel_type_1         |  0.837 15.173  0.700 | -0.308  2.532  0.095 |
highway_mpg_fuel_type_1      |  0.800 13.860  0.640 | -0.313  2.630  0.098 |
                              Dim.3    ctr   cos2  
make                         -0.472 10.475  0.223 |
model_year                   -0.120  0.679  0.014 |
vehicle_class                 0.474 10.568  0.225 |
drive                         0.158  1.172  0.025 |
engine_cylinders              0.755 26.797  0.570 |
engine_displacement           0.851 34.040  0.724 |
transmission                 -0.018  0.016  0.000 |
fuel_type_1                  -0.170  1.358  0.029 |
city_mpg_fuel_type_1         -0.242  2.753  0.059 |
highway_mpg_fuel_type_1      -0.351  5.790  0.123 |

3D Biplot for 6 clusters

Warning in PCA(data_prepared, graph = FALSE): Missing values are imputed by the
mean of the variable: you should use the imputePCA function of the missMDA
package

After looking at the silhouette plot in the unsupervised learning part, we decided to provide a 3D biplot for 6 clusters, as we can also see in the elbow plot that 6 seem to be optimal in a way. In this biplot, we can observe that it is possible to divide into 6 clusters. When comparing it to the 3D biplot in the ‘results_unsupervised_learning’ part, we clearly notice that cluster 2 could be divided into four smaller clusters, which indicates heterogeneity in this cluster when using only 3 clusters. However, with 6 clusters in hand, it is more difficult to interpret the 4 distinct clusters. In addition to that, it explains the second elbow in the elbow method: at 3 clusters, we obtained optimality, but we get another steep curve between cluster 5 and 6, meaning that selecting 4 or 5 clusters would not be too much of a benefit, but adding a 6th cluster could be worth capturing. Stopping at 3 cluster still is significant for us and it makes our clustering anaylsis more interpretable than 6, that’s why we selected only 3 clusters for our analysis.